Project Description:

A project from Object Recognition domain.

Context:

The features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness.

Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Data Description:

The columns of the dataset can be divided into two parts - Independent Attributes and Dependent Attributes:

compactness

circularity

distance_circularity

radius_ratio

pr.axis_aspect_ratio

max.length_aspect_ratio

scatter_ratio

elongatedness

pr.axis_rectangularity

max.length_rectangularity

scaled_variance

scaled_variance.1

scaled_radius_of_gyration

scaled_radius_of_gyration.1

skewness_about

skewness_about.1

skewness_about.2

hollows_ratio

All the above mentioned attributes are numerical.

class: Although 4 'Corgie' models were used for the experiment - a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. The expectation was to distinguish between bus, van and either of the cars easily but as it turned out that it became very difficult to distinguish between the cars. So, as it stand out there are 3 major class of vehicle in the dataset - Bus, Van and Car.

Objective:

Apply dimensionality reduction technique - PCA and train a model using principle components instead of training the model using just the raw data, to classify a given silhouette as one of the three types of vehicle, using a set of features extracted from one of many different angles.

Import necessary libraries

1. Load the dataset

2. Pre-processing of dataset.

First five entries of the dataset

Shape of the data

The two-dimensional dataframe i.e., vehicle_data consists of 846 rows and 19 columns.

Dataframe of each attribute

All the attributes apart from class contains numerical values. As, class is not an object, it is a category so:

5 point summary of the numerical attributes

The numerical attributes can be summarised in the following manner:

compactness: There are 846 records in the attribute with a mean value of 93.678487. The value ranges from 73 to 119. The 25% off the records have values below 87, 50% below 93 and 75% below 100. The observations differ from mean by 8.234474

circularity: There are 841 records in the column with a mean value of 44.828775. The value ranges from 33 to 59. 25% of the records have values below 40, 50% below 44 and 75% below 49. The observations differ from the mean by 6.152172

distance_circularity: There are 842 records in the column with a mean value of 82.110451. The value ranges from 40 to 112. 25% of the records have values below 70, 50% below 80 and 75% below 98. The observations differ from the mean by 15.778292

radius_ratio: There are 840 records in the column with a mean value of 168.888095. The value ranges from 104 to 333. 25% of the records have values below 141, 50% below 167 and 75% below 195. The observations differ from the mean by 33.520198

pr.axis_aspect_ratio: There are 844 records in the column with a mean value of 61.67891. The value ranges from 47 to 138. 25% of the records have values below 57, 50% below 61 and 75% below 65. The observations differ from the mean by 7.891463

max.length_aspect_ratio: There are 846 records in the column with a mean value of 8.567376. The value ranges from 2 to 55. 25% of the records have values below 7, 50% below 8 and 75% below 10. The observations differ from the mean by 4.601217

scatter_ratio: There are 845 records in the column with a mean value of 168.901775. The value ranges from 112 to 265. 25% of the records have values below 147, 50% below 157 and 75% below 198. The observations differ from the mean by 33.214848

elongatedness: There are 845 records in the column with a mean value of 40.933728. The value ranges from 26 to 61. 25% of the records have values below 33, 50% below 43 and 75% below 46. The observations differ from the mean by 7.816186

pr.axis_rectangularity: There are 843 records in the column with a mean value of 20.582444. The value ranges from 17 to 29. 25% of the records have values below 19, 50% below 20 and 75% below 23. The observations differ from the mean by 2.592933

max.length_rectangularity: There are 846 records in the column with a mean value of 147.998818. The value ranges from 118 to 188. 25% of the records have values below 137, 50% below 146 and 75% below 159. The observations differ from the mean by 14.515652

scaled_variance: There are 843 records in the column with a mean value of 188.631079. The value ranges from 130 to 320. 25% of the records have values below 167, 50% below 179 and 75% below 217. The observations differ from the mean by 31.411004

scaled_variance.1: There are 844 records in the column with a mean value of 439.494076. The value ranges from 184 to 1018. 25% of the records have values below 318, 50% below 363.5 and 75% below 587. The observations differ from the mean by 176.666903

scaled_radius_of_gyration: There are 844 records in the column with a mean value of 174.709716. The value ranges from 109 to 268. 25% of the records have values below 149, 50% below 173.5 and 75% below 198. The observations differ from the mean by 32.584808

scaled_radius_of_gyration.1: There are 842 records in the column with a mean value of 72.447743. The value ranges from 59 to 135. 25% of the records have values below 67, 50% below 71.5 and 75% below 75. The observations differ from the mean by 7.48619

skewness_about: There are 840 records in the column with a mean value of 6.364286. The value ranges from 0 to 22. 25% of the records have values below 2, 50% below 6 and 75% below 9. The observations differ from the mean by 4.920649

skewness_about.1: There are 845 records in the column with a mean value of 12.602367. The value ranges from 0 to 41. 25% of the records have values below 5, 50% below 11 and 75% below 19. The observations differ from the mean by 8.936081

skewness_about.2: There are 845 records in the column with a mean value of 188.919527. The value ranges from 176 to 206. 25% of the records have values below 184, 50% below 188 and 75% below 193. The observations differ from the mean by 6.155809

hollows_ratio: There are 846 records in the column with a mean value of 195.632388. The value ranges from 181 to 211. 25% of the records have values below 190.25, 50% below 197 and 75% below 201. The observations differ from the mean by 7.438797

To check presence of missing values

Here we can see that apart from the attributes compactness, max.length_aspect_ratio, max_length_rectangularity, hollows_ratio and class every other attributes have some amount of missing values in them.

Handling of Missing values

Thus, we have replaced the missing values with the median of the column.

5 point summary of numerical attributes (after filling up the missing values)

3. Understanding the attributes.

Univariate Analysis:

From the above plots it is clear that there are no outliers in compactness and its look like it is normally distributed.

From the above plots it is clear that there are no outliers in circularity and its look like it is normally distributed.

From the above plot it is clear that the attribute distance_circularity doesn't have any outliers in them. But in distribution plot we can see that there are two peaks along with a right skewness as the long tail is on the right side (mean > median).

From the plot it is clear that the attribute radius_ratio does have outliers. Also, there is right skewness as the long tail is in the right side. The number of outliers can be calculated as:

From, the above box plot it is clear that the attribute pr.axis_aspect_ratio does have outliers in them. Apart from this, from distribution plot it is clear that the attribute is right skewed. The number of outliers can be calculated as:

From the above plots it is clear that max.length_aspect_ratio is skewed positively and does have outliers in it. The number of outliers can be calculated as:

From the above plots it is clear that there are no outliers in scatter_ratio. Also, from the distribution plot it is evident that there is right skewness as its long tail is in the right side.

From the above box plot it is clear that elongatedness doesn't have any outliers in them. There are two peaks in the distribution plot with a left skewdness as its long tail is in the left side.

From the above box plot it is clear that pr.axis_rectangularity doesn't have any outliers in them. Also, from the distribution plot it is evident that there is right skewdness.

From the above box plot it is clear that max.length_rectangularity doesn't have any outliers in them. Also, there is slight right skewdness in the attribute as it is evident from the plots.

From the above box plot it is clear that scaled_variance does have outliers in it. Also, the distribution plot indicates that there is a fair amount of right skewdness. The number of outliers can be calculated as:

From the above box plot it clear that the attribute scaled_variance.1 does have an outlier in them. Also, from the above distribution plot it is clear that it has two peaks and a long tail in the right side, thereby indicating a positive skewdness in there. The number of outlier can be calculated as:

From the above plot it is clear that there is no outliers in scaled_radius_of_gyration and there is a slight right skewdness as the long tail is in the right side.

From the above plot it is clear that there are fair numbers of outliers in scaled_radius_of_gyration.1 and there is a right skewdness as the long tail is in the right side. The number of outliers can be calculated as:

From the above plot it is clear that skewness_about does have outliers in them. Also, there is right skewdness because of the long tail. The number of outliers can be calculated as:

From the above plot it is evident that skewness_about.1 does have outliers in them and there is right skewdness in it. The number of outliers can be calculated as:

From above we can see that there are no outliers in skewness_about.2 alongwith a slight right skewdness.

From the above plots it is clear that hollows_ratio doesn't have any outliers in them. Apart from this the values are slightly left skewed.

There are 429 'cars' in the dataset along with 218 'bus' and 199 'van'.

Treating outliers in the dataset

Bivariate Analysis

Here we will be visualize as to how the different independent attributes vary with respect to the dependent attribute - 'status'.

Multivariate Analysis

This plot along with correlation matrix and heatmap will help us to analyze the relationship between the different attributes.

3. Split the dataset into test and training dataset

Here, the independent variables are denoted by 'X' and the predictor is represented by 'y'.

We will also standardize the dataset:

4. Support Vector Machine

Thus, the accuracy obtained from the model based on SVM is 95.67

From the above Classification Matrix it is clear:

5. Applying GridSearchCV

Thus, the accuracy obtained from the model is 95.28%.

From the above Classification Matrix it is clear:

6. Principal Component Analysis

Calculating Eigen Values and Eigen Vectors

Thus, from the above plot it is clear that the first eight principal components are able to explain 95% variance of data. Going forward we will use these eight principal components in our model.

From, the above pairplot it is clear that after using PCA all our attributes have become independent of each other with no correlation among them as all of them have cloud of data points.

From, the above pairplot it is clear that after using PCA all our attributes have become independent of each other with no correlation among them as all of them have cloud of data points.

7. SVM after applying PCA

Thus, the model have an accuracy of 93.3%.

From the above Classification Matrix it is clear:

Using GridSearchCV

Thus, the accuracy obtained from the above model is 93.7%.

From the above Classification Matrix it is clear:

8. Comparison of scores.

So, here we have tried four models - SVM, SVM with GridSearchCV, SVM with PCA and SVM with PCA and GridSearchCV. All the four models had predicted 59 vehicles as Cars, 133 vehicles as Bus and 62 vehicles as Van with varying amount of accuracy.

Support Vector Machine (SVM):

This model provided an accuracy of 95.67%. Along with this the model has a precision of 0.97 for Car, 0.96 for Bus and 0.95 for Van and a weighted accuracy of 0.96. It also has a recall of 0.97 for Car, 0.97 for Bus and 0.92 for Van. Out of actual 59 cars, 133 buses and 62 vans it had correctly predicted 57 cars, 129 buses and 57 vans.

Support Vector Machine (SVM) with GridSearchCV:

This model provided an accuracy of 95.28%. Along with this the model has a precision of 0.95 for Car, 0.97 for Bus and 0.92 for Van and a weighted accuracy of 0.95. It also has a recall of 0.98 for Car, 0.95 for Bus and 0.92 for Van. Out of actual 59 cars, 133 buses and 62 vans it had correctly predicted 58 cars, 127 buses and 57 vans.

Support Vector Machine (SVM) with Principal Component Analysis (PCA):

This model provided an accuracy of 93.3%. Along with this the model has a precision of 0.92 for Car, 0.95 for Bus and 0.91 for Van and a weighted accuracy of 0.93. It also has a recall of 0.98 for Car, 0.95 for Bus and 0.85 for Van. Out of actual 59 cars, 133 buses and 62 vans it had correctly predicted 58 cars, 126 buses and 53 vans.

Support Vector Machine (SVM) with Principal Component Analysis (PCA) and GridSearchCV:

This model provided an accuracy of 93.7%. Along with this the model has a precision of 0.93 for Car, 0.95 for Bus and 0.92 for Van and a weighted accuracy of 0.94. It also has a recall of 0.95 for Car, 0.95 for Bus and 0.9 for Van. Out of actual 59 cars, 133 buses and 62 vans it had correctly predicted 56 cars, 126 buses and 56 vans.

Thus, from this exercise we can come to the following conclusions:

Thus, if one wants to select a model then going for an algorithm coupled with PCA might be the best one, as it will help to predict with an accuracy slightly lower than the one not using PCA with most relevant components (the threshold being defined by the user).